Skip to content

fix: propagate hard ingest failures in all modalities, send only inserted records#230

Merged
shujaatTracebloc merged 1 commit into
developfrom
fix/modality-failure-propagation
Jun 12, 2026
Merged

fix: propagate hard ingest failures in all modalities, send only inserted records#230
shujaatTracebloc merged 1 commit into
developfrom
fix/modality-failure-propagation

Conversation

@shujaatTracebloc

@shujaatTracebloc shujaatTracebloc commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Problem

Follow-up to #223 (batch API-send failures counted as failed records, non-zero exit), which was only verified end-to-end for token_classification. An audit of the other 10 modalities confirmed the #223 mechanism itself covers all of them — every template routes through BaseIngestor._flush_batchAPIClient.send_batch, with no modality-specific bypass path — but found two real bugs of the same silent-success class, plus a diagnostics gap.

Fixes

1. Five templates swallowed exceptions → exit 0 on hard failure

image_classification, tabular_classification, tabular_regression, time_series_forecasting, time_to_event_prediction had:

except Exception as e:
    logger.error(f"{str(e)}")   # no raise

Any exception escaping ingest() — validation failure, DB error, or the fail-loud backend-registration RuntimeErrors from base.py (rejected edge-label / global-meta / prepare) — was logged and dropped, so the process exited 0 and the K8s Job was marked Succeeded. Notably, the "No data found for table name" registration failure from the original incident would still have exited 0 in these five modalities even after #223, because it surfaces as an exception, not as failed records. (#223's sys.exit(1) only fires when ingest() returns failed records; SystemExit bypasses the handler.) Each handler now re-raises, matching the other six templates.

2. Mid-batch DB failure sent the wrong records to the API

_process_batch paired the send with zip(ids, batch) — positional, truncating to len(ids). insert_batch's per-record fallback returns success IDs in scan order, so after a mid-batch DB failure the API send included the DB-failed record (a phantom backend record pointing at no MySQL row) and dropped the last inserted one (a committed row the platform never sees). The send list is now the batch minus DB failures, matched by data_id — the same rule _flush_batch already uses for failure accounting.

3. Registration-call diagnostics truncated at 100 chars

send_global_meta_meta, send_generate_edge_label_meta, prepare_dataset, create_dataset still had the pre-#223 str(e)[:100] pattern with no response= attached to the manually-raised HTTPError — a DRF 400 body was cut right after "HTTP 400: ". Mirrored the #223 send_batch fix onto all four: response attached, handler logs HTTP <status>: <body> capped at 2,000 chars. Failure propagation is unchanged (these already returned False / raised).

Tests

tests/test_modality_failure_accounting.py (39 tests, additive only — no existing test touched):

  • api-send-failure and clean-run accounting parameterized across all 11 template categories (the fix: count batch API-send failures as failed records, exit non-zero #223 tests ran category=None, which skips the file-transfer branch the 8 file-bearing categories take)
  • AST structural check that every template's except Exception handler re-raises
  • mid-batch mis-send regression guard (asserts the send carries exactly the inserted records' data_ids)
  • 400-diagnostics guards for the four client methods
  • guard that create_dataset's raise escapes ingest() (it's the one registration call whose return value isn't checked)

Full suite: 955 passed, 1 xfailed, coverage 97% (gate: 95%).

Not included (flagged for follow-up discussion)

  • JSONIngestor silently skips type-invalid records (CLI JSON runs exit 0 despite drops) — skip-vs-fail is a semantics decision
  • templates pass on_bad_lines: "warn" overriding the framework default "error"; file-bearing categories don't run DataValidator on the labels CSV, so a ragged row is silently dropped

🤖 Generated with Claude Code


Note

Medium Risk
Changes failure propagation and batch API payload selection in the core ingest path; incorrect pairing could still cause data/platform skew, but behavior aligns with existing _flush_batch accounting and is heavily regression-tested.

Overview
Extends the #223 “no silent success” behavior across all ingestion modalities so K8s Jobs fail when ingestion truly breaks.

Five modality templates (image_classification, tabular variants, time series, time-to-event) no longer log-and-swallow in except Exception; they re-raise after logging so validation/DB/backend registration errors exit non-zero (previously exit 0 / Job Succeeded).

_process_batch now builds the API payload from records that actually inserted, filtering out DB failures by data_id instead of positional zip(ids, batch)—fixing phantom backend rows and dropped committed rows on mid-batch DB errors.

APIClient registration helpers (send_global_meta_meta, edge labels, prepare_dataset, create_dataset) mirror #223’s send_batch fix: attach response= on raised HTTPError and log HTTP <status>: <body> (up to 2k chars) instead of str(e)[:100].

Adds tests/test_modality_failure_accounting.py: per-category API failure accounting, AST check that all templates re-raise, mid-batch DB/API pairing regression, and registration error logging guards.

Reviewed by Cursor Bugbot for commit 00c122e. Bugbot is set up for automated code reviews on this repo. Configure here.

…rted records

Follow-up to #223, which fixed batch API-send failure accounting but was
only verified end-to-end for token_classification. Auditing the other 10
modalities found two real bugs of the same silent-success class plus a
diagnostics gap:

1. Five templates (image_classification, tabular_classification,
   tabular_regression, time_series_forecasting, time_to_event_prediction)
   logged and SWALLOWED exceptions in main()'s except-Exception handler.
   Any exception escaping ingest() — validation failure, DB error, or the
   fail-loud backend-registration RuntimeErrors from base.py — ended with
   exit code 0 and a K8s Job marked Succeeded. The "No data found for
   table name" cascade from the original incident would still have exited
   0 in these five modalities even after #223, because it surfaces as an
   exception, not as failed records. Each handler now re-raises (matching
   the other six templates).

2. _process_batch paired the API send with zip(ids, batch), which pairs
   positionally and truncates to len(ids). After a MID-batch DB failure
   (insert_batch's per-record fallback appends successes in scan order)
   the DB-failed record was sent to the API — a phantom backend record
   pointing at no MySQL row — while the last inserted record was never
   sent. Now the send list is the batch minus DB failures, matched by
   data_id, the same rule _flush_batch already uses for accounting.

3. send_global_meta_meta, send_generate_edge_label_meta, prepare_dataset
   and create_dataset still had the pre-#223 str(e)[:100] truncation with
   no response attached — the backend's field error explaining a 400 was
   cut right after "HTTP 400: ". Mirrored the send_batch fix: response
   attached to the raised HTTPError, handler logs HTTP <status>: <body>
   capped at 2,000 chars.

tests/test_modality_failure_accounting.py (39 tests, additive only):
api-send-failure + clean-run accounting parameterized across ALL 11
template categories (the #223 tests ran category=None, skipping the
file-transfer branch), an AST structural check that every template's
except-Exception handler re-raises, a mid-batch mis-send regression
guard, diagnostics guards for the four client methods, and a guard that
create_dataset's raise escapes ingest().

Full suite: 955 passed, 1 xfailed, coverage 97% (gate: 95%).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@shujaatTracebloc shujaatTracebloc self-assigned this Jun 11, 2026
@saadqbal saadqbal self-requested a review June 12, 2026 07:06
@shujaatTracebloc shujaatTracebloc merged commit a5c1adb into develop Jun 12, 2026
7 checks passed
@shujaatTracebloc shujaatTracebloc deleted the fix/modality-failure-propagation branch June 12, 2026 07:08
Comment thread templates/image_classification/image_classification.py
shujaatTracebloc added a commit that referenced this pull request Jun 12, 2026
…#231)

Addresses the review on #230: the ~20-line run-and-report block (run
ingest(), log each failed record, sys.exit(1) on failures, re-raise hard
errors) was inlined in all 11 template scripts. The hand-maintained
copies had already drifted — five swallowed exceptions and exited 0
until #230, and the failed-record log line was wrong in three different
ways (wrapper-level .get("filename") logged None for MLM/text; the
tabular templates' .get("name") always logged Unknown).

The contract now lives once in
tracebloc_ingestor/utils/template_runner.py, exported from the package
root; each template's main() builds its modality-specific ingestor and
calls run_ingestion(ingestor, config.LABEL_FILE, ...). The helper also
fixes the failed-record identifier: filename where the category has one,
else data_id, else Unknown.

The two structural template tests are migrated to pin the new invariant
(every template imports + calls run_ingestion; any reintroduced
except-Exception/bare-except in main() must re-raise) — their old
assertions matched the inlined source pattern this change removes. The
behavioral guarantees they pinned move to direct unit tests in
tests/test_template_runner.py (SystemExit(1) on failed records,
re-raise on hard errors, SystemExit passes the handler untouched,
context-manager unwinding, identifier fallback chain).

Full suite: 962 passed, 1 xfailed; template_runner.py at 100% coverage,
total 97% (gate: 95%).

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants